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Abstract 

Smoothing methods and SiZer (Significant ZERo crossing of the derivatives) are use- 
ful tools for exploring significant underlying structures in data samples. An extension of 
SiZer to circular data, namely CSiZer, is introduced. Based on scale-space ideas, CSiZer 
presents a graphical device to assess which observed features are statistically significant, 
both for density and regression analysis with circular data. The method is intended for 
analyzing the behavior of wind direction in the atlantic coast of Galicia (NW Spain) and 
how it has an influence over wind speed. The performance of CSiZer is also checked with 
some simulated examples. 
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1 Introduction 



Coastal and marine ecosystems suffer from a variety of threats due to human and industrial activity, 
being these ecosystems specially vulnerable to oil spills and toxic dumping. Specifically, the atlantic 
coast of Galicia (NW Spain) has suffered two major ship accidents which caused serious environmen- 
tal and ecological damages: the burning of a cargo ship named Cason in 1987, and the oil spill of the 
Prestige tanker, in 2002. In the first accident, the strong winds caused a displacement of the cargo, 
and the corrosive and toxic chemical flamable products transported by Cason exploded and burned, 
while the ship was handling on a dock. Also because of the highly variable and strong winds in the 
area during a storm, the Prestige oil tanker sank in front of the Galician coast causing the largest 
environmental disaster in the atlantic coast of the Iberian peninsula with the spill of more than seventy 
thousand tonnes of fuel. Despite the occurence of these serious accidents, this area is still on the 
course of most cargo vessels and tankers navigating from the north of Europe to the Mediterranean 
Sea, Africa or America. As it is shown in Figure [Q there exists a marine traffic control zone, which 
regulates the sailing direction and distance from the coast. A buoy anchored in the area (see Figure 
Q) provides hourly collected wind speed and wind direction, being the measurements of this latter 
variable a set of circular data. 

Circular data are data that can be represented as directions in a unit circle (see Jammalamadaka and 
SenGupta 2001 , for an extensive review on this topic). This type of data arise quite frequently in many 
natural and physical sciences, such as, in marine sciences, where the study of ocean currents and 
winds is extremely important for marine operations involving navigation, search and rescue at sea or 
pollutants dispersion in the ocean. 
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Figure 1 : Atlantic coast of Galicia (NW Spain). The plot shows the marine traffic control area (arrows 
indicate the directions that ships must follow), whithin the influence area of two major lighthouses 
(white lines). The buoy registering the data is located NE from the traffic control area at longitude 
-0.21 0E and latitude 43.500N. 
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Within this context, the goal of this work is twofold: firstly, from a practical point of view, the main 
aim is to describe the wind pattern in the Galician coast during winter season, focusing on the most 
significative wind directions and their relation with wind speed. For that purpose, a meteorological data 
set consisting of wind direction and wind speed measurements will be considered (see Figure [2] for 
descriptive plots). Secondly, in order to achieve this previous goal, the CSiZer, a new exploratory tool 
based on nonparametric kernel density and regression estimators for circular data will be introduced. 

Density estimation from a sample of circular data, as well as regression estimation when the explana- 
tory variable is circular, is indeed an interesting statistical problem in a variety of applied fields. From 
a nonparametric perspective, density and regression estimation can be approached by using local 
smoothers based on kernel functions. Kernel density estimation for the general case of spherical data 
was studied by Hall et. al (1987) and nonparametric kernel methods for regression estimation for a 
circular explanatory variable and a linear response have been recently introduced by Di Marzio et 
al. (2009). As in any nonparametric procedure, kernel methods depend on a smoothing parameter 
or bandwidth, which can be data-driven selected or chosen by the researcher (see Oliveira et al. 
2012a, for a comparison on the existing bandwidth selectors for density estimation). The bandwidth 
controls the global aspect of the estimator and its dependence on the sample. Given that an unsuit- 
able smoothing parameter may provide a misleading estimate of the density or regression curve, the 
assessment of the statistical significance of observed features through the smoothed curve should be 
required for not compromising the extracted conclusions. 




Figure 2: Descriptive plots for wind direction (left, rose diagram) and wind speed (right, histograma and 
kernel density estimator). Wind direction is measured in angles and represented in the circumference 
in clockwise sense, starting from N direction. Wind speed is measured in m/s. 

The SiZer method, developed by Chaudhuri and Marron (1999) for linear data, provides a means of 
circumventing the smoothing parameter selection and, at the same time, allows for the assessment 
of statistically significant features in the data structure. The original SiZer is a visualization method 
based on nonparametric curve estimates. SiZer addresses the question of which features observed 
in a smoothed curve are really present, or represent an important underlying structure, and not simply 
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artifacts of the sampling noise from a scale-space perspective. In the nonparametric curve estimation 
context, the scale-space framework is given by a family of kernel smoothers indexed by the bandwidth 
parameter. SiZer considers a wide range of bandwidths, which avoids the problem of bandwidth 
selection, whilst peaks and troughs are identified by finding the regions of significant gradient (zero 
crossings of the derivative), presenting this information in a simple visual way by the SiZer map. 

Several adaptations of SiZer have been proposed in the statistical literature, making it possible to 
extend this graphical tool to a variety of contexts such as local likelihood (Li and Marron, 2005), 
dependent data (Rondonotti et al. 2007) and survival data (Marron and de Una Alvarez, 2004), among 
others. SiZer for linear variables has been successfully applied in many different scientific fields. For 
example, Rudge (2008) uses this method to find peaks in geochemical distributions; Sonderegger et 
al. (2009) consider SiZer to detect threshold in ecological data and Ryden (2010) applies SiZer to 
determine a possible increasing trend in hurricane activity in the North Atlantic. 

In the special setting of circular data, both for kernel density and regression estimation, the adapta- 
tion of SiZer ideas must take into account the nature of the data. This particular scenario involves, 
specifically: (1) the assessment of the variability in the derivatives of circular kernel estimators, both 
for density and regression, through the computation of standard deviations and appropriate quantiles; 
(2) the development of a suitable visualization device to facilitate the practitioner the output interpreta- 
tion. Bearing these premises in mind, the SiZer ideas can be fitted to the circular data setting yielding 
the CSiZer plot presented in this work. The CSiZer plot is produced using self-programmed code 
developed in the free software environment R (R Development Core Team, 2012). 

This paper is organized as follows. Section [2] provides a brief overview on kernel density estimation 
for circular data, and regression estimation for a circular explanatory variable and a linear response. 
Section|3]is devoted to the introduction of the CSiZer plot, detailing its construction and interpretation. 
The performance of the new CSiZer is illustrated with some simulated examples and real data in 
Section m CSiZer is used for describing the wind direction and the relation between wind speed and 
wind direction in the Galician coast during winter season. A brief discussion on the proposal and some 
final comments are provided in Section I47H 



2 Nonparametric curve estimation for circular data 



CSiZer will be based on nonparametric estimates of the target curve, density or regression. For this 
purpose, this section provides a brief background on circular kernel density estimation and local linear 
regression, for circular explanatory variable and linear response. See Oliveira et al. (2012b) for a 
comprehensive review on nonparametric methods for circular data. 
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2.1 Nonparametric circular density estimation 

Given a random sample of angles 81, 6 2 , . . . , 9 n G [0, 2n) from some unknown density /, the kernel 
circular density estimator of /, at an angle 6, is defined as: 

1 n 

f(P;u) = ~J2Ku(9-Qi), 0<e<2ir, (1) 
i=i 

where iT w is a circular kernel function with concentration parameter u > (see Di Marzio et al., 2009). 
As a circular kernel, the von Mises density can be considered. Also known as the circular Normal, the 
von Mises model, vM(fj,,K), is a symmetric unimodal distribution characterized by a mean direction 
H e [0, 2tt), and a concentration parameter k > 0, with probability density function 

g(d;fi,K) = 1 exp{Kcos(6> -//)}, < < 2vr, 

where J denotes the modified Bessel function of order 0, which is just a normalizing constant ensuring 
the unit integral of the density. With this specific kernel, the density estimator (Q) is given by: 

1 n 
n(2vr)/ (^) ~ 

which is a mixture of von Mises distributions centered in 0j and with concentration parameter v. 

A critical issue when using this estimator in practice is the choice of the smoothing parameter v. 
Large values of v lead to highly variable (undersmoothed) estimators, whereas small values of v 
imply low concentration of the kernel around the observations, providing oversmoothed estimators for 
the circular density. The effect of the smoothing parameter v is illustrated in Figure |3] where kernel 
density estimates for the wind direction distribution are shown. When a midrange bandwidth is used, 
v = 10 (solid line), the estimate shows two modes suggesting that the wind comes mainly from NE and 
SW. However, these modes may disappear if the selected value of the bandwidth is smaller {v = 1, 
dashed line). Also, many more modes, which are likely to be spurious sampling artifacts, appear for 
larger bandwidths (u = 60, dotted line). A crucial issue is then how to choose the bandwidth. There 
are several approaches to the problem of chosing the smoothing parameter in this setting (see, e.g., 
Hall et al. 1987 and Oliveira et al. 2012a). Usually, the bandwidth parameter is selected in order 
to minimize some error criterion, such as the mean integrated squared error between the density 
estimate and the unknown true density. Although based on nonparametric kernel circular density (and 
regression) estimation, the goal of this paper is to identify which observed features are "really there", 
avoiding the selection of an "optimal" bandwidth parameter. 
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Figure 3: Example of features revealed by smoothing in density estimation. Kernel circular density 
estimates with three different bandwidths: v = 1 (dashed line), v = 10 (solid line) and v = 60 
(dotted line) for wind direction data in circular (left panel) and linear (right panel) representations. 
Wind direction is represented over the circumference in clockwise sense, starting from N. 

2.2 Nonparametric circular-linear regression estimation 

Let Yi), i = 1, . . . , n} be a random sample from (0, Y) a circular and a linear random variables, 
respectively. The relation between these variables can be modeled by 

Y i = f{e i ) + e i , i = l,...,n, (2) 

where, / denotes now the regression function and ei are real-valued random variables with zero mean 
and variance a 2 . The local circular-linear regression estimate for f(6) and f'(9) at an angle 8 are 
given by f(6; v) = a and f'(6; v) = b, where 

n 

(a,b) = argmmYK u (9-O l )[Y i -(a + bsm(6-Q i ))} 2 (3) 

(see Di Marzio et al., 2009 for details). In equation ©, u is the smoothing parameter and K v is a 
circular kernel function, and as for density estimation, a von Mises kernel with concentration parameter 
v is used throughout this work. With respect to the smoothing parameter, large values of v lead to 
undersmoothed estimations of the regression curve, exaggerating the local features in the sample 
and tending to an interpolation of the data. On the other hand, small values of v result in a global 
averaging, oversmoothing the local characteristics in the data. This effect can be checked on the real 
data example, as shown Figure 01 when plotting the estimator of the regression function for the wind 
speed (response) taking the wind direction as a covariate. A small value of the smoothing parameter 
(v = 1, dashed line) provides an oversmoothed estimation indicating that there is no effect of the 
wind direction over the wind speed. However, for an intermediate value of v (is = 10, solid line) wind 
speed is higher when wind comes from NE and S and lower when coming from SE. These features 
are also shown by the larger bandwidth (y = 60, dotted line) but, in this case, the estimator seems to 
be substantially undersmoothed. 
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Figure 4: Example of features revealed by smoothing in regression estimation. Circular-linear regres- 
sion estimates, with three different bandwidths: v = 1 (dashed line), v = 10 (solid line) and v = 60 
(dotted line) for wind speed and wind direction data in circular (left panel) and linear (right panel) rep- 
resentations. In the left plot, wind direction is represented over the circumference in clockwise sense, 
starting from N and wind speed is represented along the radius. 

A simple and widely used procedure for bandwidth selection in the regression setting is cross-validation, 
which pursues an "optimal" choice of the smoothing level (see Oliveira et al. 2012b). As already com- 
mented for the kernel circular density estimation problem, in the next section, a method for exploring 
the different features that occur on a range of smoothing parameter values is proposed, avoiding the 
problem of selecting a specific smoothing parameter. 



3 CSiZer: SiZer map for circular data 

As noticed in the previous section, bandwidth selection is a critical issue for nonparametric density 
and regression estimations. Apart from the lack of a uniformly superior rule for that purpose, from a 
practical point of view, the exploration of the estimators at different smoothing degrees (for a range of 
reasonable bandwidth values, between oversmoothing and undersmoothing levels) will provide more 
in-depth information about the available data. However, significant features in the underlying data 
structure should be effectively disentangled from sampling artifacts. Features like peaks and valleys of 
a smooth curve can be characterized in terms of zero crossings of derivatives. Hence, the significance 
of such features can be judged from statistical significance of zero crossings or equivalently the sign 
changes of derivatives. This idea has been sucessfully exploited by Chaudhuri and Marron (1999) in 
developing a simple yet effective tool called SiZer for exploring significant structures in density and 
regression curves. 

In the usual inferential approach in the statistical literature, the spotlight is placed on the true underlying 
curve / (the regression or the density function) and doing inference on it, in particular, based on 
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confidence bands. A crucial problem in nonparametric estimation is that f(9;v) = E(f(8;u)) is not 
necessarily equal to f(9), involving an inherent bias specially for small values of v (see Figure [51 
left). The bias can be reduced by taking large values of u, but in this case the estimator is highly 
variable, depending strongly on the data sample (see Figure fright). Chaudhuri and Marron (1999) 
avoid the bias-variance trade off problem by adopting the scale-space ideas which naturally lead to 
making inference on the smoothed curve /(•; u) rather than on the curve /. It should be noted that, 
for small values of u, the smoothed curve /(•; u) can be very different from /. However if v is within a 
reasonable range, /(•; v), which can be thought as the curve at a resolution level v, shows the same 
valley-peaks structure as / (see Figure |5] center). 




Figure 5: Left and center: True density: Mixture of two von Mises in the same proportion vM(tt/2,A) 
and vM(37r/2, 4) (solid line), nonparametric density estimators for the mixture from 1 00 random sam- 
ple of size 250 (gray curves) with v = 0.2 and v = 2, respectively. Right: Nonparametric regression 
estimators (gray curves) for the sine model (solid line) from 100 random samples of size 250 with 

v = 50. 

Thus, in order to assess the significance of features such as peaks and valleys, instead of constructing 
confidence intervals for /'(#), SiZer seeks confidence intervals for the scale-space version f'(9; v) = 
E(/'(0; vj). As usual, confidence limits are of the form 

f'(9;v)±q-^(f'(9;v)), (4) 

where q is an appropiate quantile and sd is the estimated standard deviation (details on its computation 
are given below). 

So, at each pair (8,v), if lies in the corresponding interval, the slope of the smoothed curve is not 
significant, whereas positive and negative intervals will indicate increasing and decreasing trends. 
Behaviour at 8 and v will be presented via CSiZer color map, as discussed in Section [3T2l 

In the linear case, Chaudhuri and Marron (1999) suggested several methods for the approximation 
of the quantile q, including pointwise and simultaneous Gaussian quantiles and also bootstrap quan- 
tiles. In our setting, accurate intervals without imposing Gaussian assumptions in (@) can be obtained 
by bootstrap (see Efron and Tibshirani, 1993). A possible way to get such intervals, namely the 
"bootstrap-t" approach, is detailed below. Given a significance level a and for a fixed value of v > 
and with 9 varying in the interval [0, 2ir), the following algorithm is considered: 
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Step 1 . Generate B bootstrap samples, i.e., random samples drawn with replacement from the data. 
Step 2. For each bootstrap sample, compute 

sd(/'(0;i/)*) 

where f'(9; u)* is the value of f'(9; v) for the bootstrap sample and sd(/'(0; v)*) is an estimator 
of the standard deviation of f'(9; v)* (see Section I3TTI for its calculation). 

Step 3. Based on Z*, ... ,Z* B compute the a and (1 - a) sample quantiles, and i^ l ~ a \ respectively. 

Step 4. The "bootstrap-f confidence interval is given by 

'/'(*; v) - t (1 - a) ■ * )), /'(*; ") - t (a) ■ Sd(/'(0; u)) 



where sd (/'(<?; z/)) is an estimator of the standard deviation of f'(9; v) (see Section [3TTI for its 
calculation). 

3.1 Estimation of the standard deviation 

For the computation of confidence intervals, it is necessary to derive an expression for sd(/'(0;i/)) 
(and also for its bootstrap version sd(/'(0;i/)*), involved in the standarization procedure in Step 2 of 
the previous algorithm). The main idea behind the calculation, in the context of density estimation, 
is that the derivative estimator f'{9;v) is a weighted average of the derivative of the kernel function 
at different locations. Specifically, for the problem of density estimation and following Chaudhuri and 
Marron (1999), our proposal is to estimate the variance of f'(9; v) by 



<w{f\9-v)) = ™{n- l YJUK(0-®i)) 

= rr x s l [K' v (e-Q 1 ),...,K' v {p-Q n )), 0<9<2n, 

where s 2 is the usual sample variance of n data, which in this context is formed by the derivative of 
the kernel centered at each sample value 6j, with i = 1, . . . , n. 

In the regression setting, the derivative estimator is given by f'(9; v) = b, see (J3). It can be shown 
(see, for instance, Wasserman (2006), p.77) that f'(9; v) can be written as 



1 n 

f{9;v) = -Y j W u {9,Q i )Y h 



n 

i=l 



for some certain weights W„(9,Qi) which can be easily computed from the kernel K v . So, the vari- 
ance of f'(9; v) is given by: 
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var 



(/'(*; *)) 



var (n" 1 E"=i W v (0, e^Y^, . . . , n ) 



A crucial problem is how to estimate the conditional variance cx 2 (Yi|6;). When the covariate is linear, 
Chaudhuri and Marron (1 999) estimate this quantity by smoothing the residuals using the same "linear" 
bandwidth as the one used to calculate the estimator. When the covariate is circular, the bandwidth 
used to compute the estimator is devised for a circular framework, whereas the residuals are linear. 
Hence, it does not seem reasonable to smooth the residuals using the same bandwidth u, which 
comes from the circular setting. In order to avoid the calculation of a new bandwidth for smoothing 
the residuals, the standard deviation will be approximated by bootstrap. For a given v > and with 9 
varying in [0, 2ir), the standard deviation of f'(9; v) is estimated following the next steps: 

1 . Generate B bootstrap samples, each one consisting of n data values drawn with replacement 
from the observed sample {(6j, Yj); i = 1, . . . , n}. 

2. For each bootstrap sample, calculate f* b (9; v), with b = 1, . . . , B. 

3. Estimate the standard deviation by the sample standard deviation of the B replicates: 



3.2 Reading CSiZer 

As noted above, with CSiZer, significance features in the data will be seeked via the construction 
of confidence intervals for the scale-space version of the smoothed derivative curve. Although the 
procedure for obtaining these intervals must be carefully adapted for circular data involved in density 
estimation and introduced as covariates in regression estimation with linear response, as shown along 
this section, the interpretation of the output through CSiZer map is fairly simple. 

Recall that, for a given pair (9, is), the curve at a smoothing level u is significantly increasing (de- 
creasing) if the confidence interval is above (below) and if the confidence interval contains 0, the 
curve at the smoothing level v and at the point 9 does not have a statistically significant slope. This 
information can be displayed in a circular color map in such a way that, at a given v, the performance 
of the estimated curve is represented by a color ring with radius proportional to v. Differents colors 
will allow to indentify peaks and valleys. 

Blue (black, for black and white versions) color indicates locations where f(9, u) is significantly in- 
creasing; red (dark gray) color shows where it is significantly decreasing and purple (gray) indicates 
where it is not significantly different from zero. There is also a fourth color, gray (light gray), corre- 
sponding to those regions where there is not enough data to make statements about significance. 



{f'{0;v)) = [s 2 [f'* 1 (e,u),...,r B (9,u)) 



1/2 



sd 
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Thus, at a given bandwidth, a significant peak can be identified when a region of significant positive 
gradient is followed by a region of significant negative gradient (i.e. blue-red pattern), and a significant 
trough by the reverse (red-blue pattern), taking clockwise as the positive sense of rotation. In Section 
IS some examples of CSiZer map with simulated and real data are shown. 

To determine the gray areas (not enough data), for each (0, v) the estimated effective sample size 
(ESS) is calculated as 

Following Chaudhuri and Marron (1999), regions where EES(9,v) < 5 are shaded gray. 



4 Examples and real data analysis 

In this section, the performance of CSiZer for density and regression is illustrated with some simulated 
examples and a real dataset. The corresponding CSiZer maps were obtained from self-programed 
code in R (see R Development Core Team, 2012), which is available as supplementary material. 

For density estimation, the performance of CSiZer has been studied in some simulated scenarios, 
which highlight how CSiZer displays the information available in the data. Several circular distributions 
have been considered: von Mises (D1), mixture of two and four von Mises (D2, D4) and mixture of two 
wrapped Cauchy and two wrapped skew-normal distributions (D3). See Figure [6] (left column) and 
Figure [7] (top row) for density plots and Oliveira et al. (2012a) for specific formulae. Throughout this 
section, statistical significance is assessed with a significance level a = 0.05. 

Figure [6] (rigth column) presents CSiZer maps for densities, D1, D2 and D3, with random samples of 
size n = 200. The CSiZer maps can be easily obtained by calling the function csizer . density (x, 
NU, ngrid, alpha, B, type) . The arguments in this function are x, the angle data sample; NU, a 
grid of positive smoothing parameters and ngrid, an integer indicating the number of equally spaced 
angles between and 2ir where the estimator is evaluated (default to ngrid=250). A significance 
level alpha can be also fixed (default to alpha=0.05), as well as the number of bootstrap samples B 
to estimate the standard deviation of f'(9; v) (default to B=500). Finally, type is a number indicating 
the labels that appear in the plot: 1 (directions), 2 (hours), 3 (angles in radians) or 4 (angles in 
degrees). Default is type=3. It is clear that the CSiZer maps show the significance of the unimodal, 
bimodal and cuatrimodal structure for each density, respectively. Taking clockwise as the positive 
sense of rotation, Figure H (top-right) displays a blue area followed by a red area for a wide range 
of bandwidths, indicating a significant increase then decrease, i.e., unimodality. In Figure H (center- 
right), the bimodal structure is clearly brought out by the CSiZer map, as the two peaks and the trough 
can be identified by the clockwise blue-red-blue-red pattern on the map that occurs for a range of 
bandwidths between v = 1 and v = 31. In Figure[6](bottom-right), it can be seen that only two modes 
are identified for values of the smoothing parameter smaller than v = 10 but, for larger values of this 
parameter the cuatrimodal structure is obvious. 
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Figure 6: CSiZer maps (right column) for kernel density estimates based on simulated data from 
densities, D1, D2 and D3 (left column). Sample size n = 200. For reading CSiZer, take clockwise 
sense of rotation. Values of v are indicated along the radius. 



The effect of increasing the sample size n in model D4 can be seen in Figure For n = 200, CSiZer 
map detects only two signficant modes (see Figure [7] bottom-left). However, the underlying three 
modes are significant for n = 500 (Figure □ bottom-right). 

For illustrating the performance of CSiZer in estimating a regression model such as ((2), the regression 
function displayed in Figure [8] (left panel) has been considered. This is the same model already ana- 
lyzed by Di Marzio et al. (2009) and the fourth model in the illustration of kernel estimators presented 
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D4 




Figure 7: CSiZer maps for kernel density estimates based on simulated data with sample size n = 
200 (bottom-left) and n = 500 (bottom-right) from density D4 (top row). For reading CSiZer, take 
clockwise sense of rotation. Values of v are indicated along the radius. 



by Oliveira et al. (2012b). A sample of 200 observations from model ((2), with normally distributed 
errors with variance a 2 = 0.5, has been generated in order to produce the CSiZer map. 

In the regression setting, the CSiZer map can be obtained by calling the function csizer . regre- 
ssion (x, y, NU, ngrid, alpha, B, B2 , type) . The first arguments for this function are x, the 
circular covariate values and y, the linear response vector. The arguments NU, ngrid, alpha, B and 
type are the same as for the density case, with the same default values for alpha, B and type. 
Default for ngrid=150. For the regression CSiZer, a further argument B2 is required. B2 is the 
number of bootstrap samples used to compute the denominator in Step 2 of the algorithm (default 
is B2=250). From Figure [8] (left panel), it is clear that the regression model to estimate may present 
some challenges given the highly peaked mode centered in it and another less concentrated mode 
about 7tt/4. Nevertheless, as it can be seen in the CSiZer map presented in Figure [8] (right panel), 
the two modes are identified as significant along the range of bandwidths. 



13 



Figure 8: CSiZer map for kernel regression estimate based on simulated data with sample size n = 
200 from model ((2). The regression function is plotted in the left panel. For reading CSiZer, take 
clockwise sense of rotation. Values of v are indicated along the radius. 



4.1 Exploring wind patterns using CSiZer 

The practical usefulness of the proposed CSiZer map is illustrated by the analysis of a real dataset 
concerning wind direction and speed in the atlantic coast of Galicia (NW-Spain). Meteorological 
and oceanographic variables related to wind and currents behaviour are collected by a standard buoy 
(model SeaWatch). With a diameter of 1 .8m and a height of 6.5m, the buoy is anchored at the location 
specified in Figured] far away from the coastline so that the measurements are not influenced by local 
effects. Wind measurements regarding direction and speed are recorded every ten minutes, and 
hourly averaged, at a height of 3m above sea level. Data can be freely downloaded from the Spanish 
Portuary Authority (Puertosdel Estado, http://www.puertos.es). 

The dataset consists of hourly observations of wind direction (in degrees) and wind speed (in m/s) in 
winter season (from November to February), from 2003 until 2012. For the circular representation, 
as in previous plots, wind direction is marked over the circumference clockwise, starting from N. In 
order to avoid the dependence present between consecutive measurements in the time series, the 
autocorrelation functions were studied. Observations taken with a lag period of 95 hours can be 
considered as uncorrelated, providing a final dataset with about 200 values. With this lag period, all 
the day hours are represented in the sample. 

Figure [9] shows the CSiZer maps for wind directions (left plot) and CSiZer for regression (right plot), 
applied for exploring the relation between wind speed as a response and wind direction as a covari- 
ate. In Figure |9] (left plot), the two significant modes that can be distinguished for a wide range of 
bandwidths indicate that winds in winter period come mostly from NE and SW. Winds from SE are not 
frequent at all, being this fact reflected by the absence of data in the SE sector (gray shaded area). 
In addition, it can be also seen that wind speed increases when wind direction comes from NE and S 
(Figured! right plot). 
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Figure 9: CSiZer map for kernel density estimator (left) for wind direction and CSiZer map for circular- 
linear regression (right) for wind speed (m/s) with respect wind direction. For reading CSiZer, take 
clockwise sense of rotation. Values of v are indicated along the radius. 



Final comments 



An extension of SiZer to circular data, the CSiZer, both for density and regression, has been proposed. 
The performance of CSiZer has been checked by some simulated examples and it has also been ap- 
plied to analyze wind patterns in Galician coast during winter season. In order to effectively produce 
the CSiZer map, the assessment of the variability in the derivatives of the circular kernel estimators, 
both for density and regression, is approached through the computation of standard deviations and 
appropriate quantiles by bootstrap methods. Despite the technical details behind the CSiZer deriva- 
tion, possibly overwhelming for a practitioner, the graphical output appearance allows for an easy and 
useful interpretation. 

As mentioned in the introduction, the SiZer technique has been adapted to other settings. Although 
most of the previous works, and also the proposal presented in this paper, consider smoothers based 
on kernels, the technique could be adapted for other type of smoothers such as splines (Marron and 
Zhang, 2005). The same extension could be possible for circular data, although suitable modifications 
should be done in order to account for the periodic nature of the data. 

It should be noted that circular data are just a particular case of spherical data (data on the q- 
dimensional sphere). In principle, the methodology presented in Section 3 could be extended to 
higher dimensions. Nevertheless, the lack of a simple visualization device will certainly hampered the 
practical purpose of CSiZer for general dimension. 

Finally, self-programmed code has been implemented for applying the proposed methods in practice. 
This code, developed in R (R Development Core Team, 2012), is available as supplementary material. 
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